A Hamden resident who recently purchased his house had his assessed property value significantly increase within a year of the purchase. He began to question whether the assessed value was fair. Our goal is to determine whether the assessed values for residential properties in Hamden, CT are fair and determine if there is any evidence to support contesting the assessed value.
In order to answer this question, we used 2009 and 2024 Hamden sales to build a sale pricing model, then applied it to all properties in an attempt to get the would-be 2024 sale price for all homes in Hamden. By comparing our predicted 2024 sale price to the actual assessments in 2023, we determined whether the assessor’s values align with our estimate of value and are therefore “fair.”
First, we read in the sales data, CT property data, and Hamden html files. We filtered the sales data and CT property data to only include single family homes in Hamden. We then extracted the sale date and sale price from the html files and merged it with the sales data. Next, we read in the 2009 data and extracted the sale price from the html files. Finally, we combined all of this data into one dataset for analysis (with 2009 and 2024 sales in separate rows), and added centroid/location data.
# read in sales data
sale_data <- read_excel("sales_data.xlsx")
sale_data = sale_data %>%
filter(Description == 'Single Fam M01') %>%
rename(
Sale.Price.2024 = 'Sale Price',
Sale.Date.2024 = 'Sale Date'
)
# read in CT property
hamden_data <- read_csv("Connecticut_CAMA_and_Parcel_Layer_3895368049124948111.csv")
hamden_data = hamden_data %>%
filter(Town_Name == "Hamden", Unit_Type == "Single Fam M01")
# read in Hamden html files
for(i in 1:nrow(hamden_data)){
hamden_data$PID[i] <- str_split(hamden_data$link[i],"-")[[1]][2]
}
hamden_data$PID <- as.numeric(hamden_data$PID)
hamden_data$Sale.Date <- NA
hamden_data$Sale.Price <- NA
hamden_data$PID <- as.numeric(hamden_data$PID)
for(i in 1:nrow(hamden_data)){
temp_html <- read_html(paste0('Hamden_Sept2025/',hamden_data$PID[i],'.html'))
tables <- temp_html %>%
html_elements("table") %>%
html_table(fill = TRUE)
if (length(tables) >= 5) {
sale_table <- tables[[5]]
sale_table <- sale_table[!sale_table$X1 == "", ]
sale_table <- sale_table %>%
pivot_wider(names_from = X1, values_from = X2)
if ("Sale Date" %in% names(sale_table)) {
hamden_data$Sale.Date[i] <- sale_table$`Sale Date`[1]
}
if ("Sale Price" %in% names(sale_table)) {
hamden_data$Sale.Price[i] <- sale_table$`Sale Price`[1]
}
val_table <- tables[[3]]
hamden_data$`Current Year Assessment`[i] <- val_table$Total
hamden_data$`Current Year Assessment`[i] <- str_replace_all(hamden_data$`Current Year Assessment`[i], '[\\$,]', '')
}
}
hamden_data <- hamden_data %>%
dplyr::select(-'Sale Date', -'Sale Price')
# read in 2009 data
d2009 <- read_csv("Hamden2009.csv")
d2009 = d2009 %>% rename('PID' = 'pid','Number.of.Bedroom' = 'bedrooms', 'Number.of.Baths' = 'bathrooms', 'Number.of.Half.Baths' = 'halfbaths', 'Living.Area' = 'livingarea', 'Land.Acres' = 'landsize', 'Current Year Assessment' = 'assessedvalue', 'Total.Rooms' = 'totalrooms', 'ayb' = 'yearbuilt')
d2009$Sale <- rep(NA, nrow(d2009))
d2009$Year <- rep(2009, nrow(d2009))
hamden_data$Year <- rep(2024, nrow(hamden_data))
# left join
sale_data$Location <- paste(sale_data$`Property Number`, sale_data$`Street Name`)
merged <- left_join(hamden_data, sale_data, by = "Location")
merged = merged %>% rename('Sale' = 'Sale.Price.2024')
colnames(merged) <- gsub(" ", ".", colnames(merged))
for(i in 1:nrow(d2009)){
temp_html <- read_html(paste0('Hamden_Sept2025/',d2009$PID[i],'.html'))
if(is.null(temp_html)){
next
}
tables <- temp_html %>%
html_elements("table") %>%
html_table(fill = TRUE)
if(length(tables) < 6){
next
}
sale_table <- tables[[6]]
sale_year <- c()
for(j in 1:nrow(sale_table)){
sale_year[j] <- str_split(sale_table$`Sale Date`[j], '/')[[1]][3]
}
if(max(as.numeric(str_replace_all(sale_table$`Sale Price`[which(sale_year == 2009)], '[\\$,]', ''))) > 0) {
d2009$Sale[i] <- max(as.numeric(str_replace_all(sale_table$`Sale Price`[which(sale_year == 2009)], '[\\$,]', '')))
} else {
d2009$Sale[i] <- NA
}
}
merged <- merged %>% rename(PID = PID.x)
merged <- merged %>% rename(`Current Year Assessment` = `Current.Year.Assessment.x`)
same_cols <- intersect(colnames(merged), colnames(d2009))
merged <- rbind(subset(merged, select = same_cols), subset(d2009, select = same_cols))
merged <- merged %>% group_by(PID) %>% filter(n() > 1) %>% ungroup()
merged <- merged %>% arrange(PID)
# add in centroid/location data
centroids <- readRDS("centroids.no.geometry.rds")
centroids <- centroids[centroids$Town_Name == 'HAMDEN', ]
for(i in 1:nrow(centroids)){
centroids$PID[i] <- str_split(centroids$Link[i],"-")[[1]][2]
}
centroids$PID <- as.numeric(centroids$PID)
model_data <- merged %>%
left_join(centroids, by = c("PID" = "PID"))
# write csv
write.csv(model_data, "model_data.csv", row.names = FALSE)
First, we looked at the Assessed Value/Sale Price ratio (ASR) for homes sold in 2024 to understand if assessed value is considered “fair” in 2024. The histogram and scatterplot below show the distribution of ASR for homes sold in 2024, comparing with 2024 assessed values.
The median assessed value over sale price ratio for 2024 is 0.6602581, which is slightly lower than the statutory level of assessed value ratio of 0.7 for fair market value in Connecticut.
From the Leaflet Analysis, we see that almost all ASRs for homes sold in 2024 are below .7. The lowest ASRs are in the Northeast part of Hamden, while the highest ASRs are in the Southeast region. There seems to be more variability in ASR in Southern Hamden, especially compared to the Northwest region, which is predominantly homes that have ASRs of .7 and are most accurately assessed. The second Leaflet plots the 2024 Sales Price of homes sold in 2024. We see that the more expensive homes are in the Northwest and Southeast regions. It’s difficult to tell whether there is a clear link between ASR and home price, but it looks like ASRs may be higher and more variable in these areas, suggesting that the assessor may assessing more expensive homes more inconsistently.
## Warning: NAs introduced by coercion
## [1] 0.6602581
Next we performed correlation analysis to identify which property features to consider including in our model predicting sale price in 2024. We see that Living Area, Number of Baths, Total Rooms, and Year all have the strongest correlations with Sale Price. However, there appear to be correlations between these potential predictor variables, and we therefore need to be careful of collinearity when building our model. For example, Living Area is highly correlated with Number of Baths and Total Rooms.
We also looked at scatterplots of all potential variables with Sale Price in order to see if variables meet the linear relationship assumption for linear regression. There appear to be linear relationships between Sale Price and Living Area, Number of Baths, and Total Rooms. However, there may be nonlinear relationships between Sale Price and Land Acres and Year. In terms of potential transformations, we see that Sale Price is right-skewed and may benefit from a log transformation.
## Warning: package 'psych' was built under R version 4.4.3
##
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Next, we built a predictive model using properties that actually sold in 2024. We performed stepwise regression using AIC and 10-fold cross-validation to select the best model. The final model includes Living Area, Number of Baths, Total Rooms, Year, and Land Acres as predictors. Based on cross-validation, there is no sign of significant overfitting, as our full-sample RMSE (0.1700736) is similar to the average cross-validated RMSE (0.1728461).
Finally, the residual plots below show that the linear model assumptions are reasonably satisfied. The residuals appear to be normally distributed, homoskedastic, and there is no obvious pattern in the residuals vs fitted values plot. However, there appears to be one extreme outlier with high leverage.
We plotted our predicted versus actual sale prices in 2009 and 2024 to see how well our model performed. The plot shows that our predicted sale prices are generally close to the actual sale prices, with most points falling near the 45-degree line. However, we can see that for higher sale prices, our model tends to underpredict the sale price. We then applied the final model to properties that did not sell in 2024 to predict what those properties would have sold for in 2024.
##
## Call:
## lm(formula = log(Sale) ~ Living.Area + Year + Total.Rooms + Number.of.Bedroom,
## data = train_model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83721 -0.10596 0.00841 0.10635 0.56936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.902e+01 1.923e+00 -20.295 < 2e-16 ***
## Living.Area 2.779e-04 1.569e-05 17.711 < 2e-16 ***
## Year 2.535e-02 9.538e-04 26.579 < 2e-16 ***
## Total.Rooms 2.490e-02 9.614e-03 2.590 0.00983 **
## Number.of.Bedroom -2.597e-02 1.439e-02 -1.804 0.07174 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1707 on 570 degrees of freedom
## Multiple R-squared: 0.7432, Adjusted R-squared: 0.7414
## F-statistic: 412.3 on 4 and 570 DF, p-value: < 2.2e-16
## R RNG seed set to 4250
## 10-Fold Cross Validation
## criterion: mse
## cross-validation criterion = 0.02988069
## bias-adjusted cross-validation criterion = 0.02980547
## 95% CI for bias-adjusted CV criterion = (0.02526121, 0.03434973)
## full-sample criterion = 0.0288893
## CV criterion by folds:
## fold.1 fold.2 fold.3 fold.4 fold.5 fold.6 fold.7
## 0.03602737 0.04273959 0.02917213 0.02805398 0.03995698 0.02700037 0.02851028
## fold.8 fold.9 fold.10
## 0.01910643 0.02402297 0.02375112
##
## Coefficients by folds:
## (Intercept) Living.Area Year Total.Rooms Number.of.Bedroom
## Fold 1 -3.90e+01 2.96e-04 2.53e-02 2.66e-02 -3.63e-02
## Fold 2 -3.82e+01 2.60e-04 2.49e-02 2.62e-02 -2.64e-02
## Fold 3 -3.88e+01 2.82e-04 2.53e-02 2.35e-02 -2.55e-02
## Fold 4 -3.95e+01 2.83e-04 2.56e-02
## Fold 5 -3.94e+01 2.78e-04 2.55e-02 1.71e-02
## Fold 6 -3.82e+01 2.75e-04 2.49e-02 1.41e-02
## Fold 7 -3.98e+01 2.72e-04 2.57e-02 2.97e-02 -3.26e-02
## Fold 8 -3.85e+01 2.76e-04 2.51e-02 2.55e-02 -2.76e-02
## Fold 9 -3.90e+01 2.71e-04 2.53e-02 2.94e-02 -2.98e-02
## Fold 10 -3.91e+01 2.76e-04 2.54e-02 2.52e-02 -2.67e-02
## Number.of.Baths
## Fold 1
## Fold 2 0.03
## Fold 3
## Fold 4 0.03
## Fold 5
## Fold 6
## Fold 7
## Fold 8
## Fold 9
## Fold 10
fit <- m.select
pred_sale <- predict(fit, newdata = val_model_data)
val_data$predSale <- exp(pred_sale)
ggplot(data = val_data, aes(x = Sale, y = predSale)) +
geom_point(color = "blue", alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(title = "Predicted vs Actual Sale Price",
x = "Actual Sale Price",
y = "Predicted Sale Price")
The plot of Sale Price and Assessed Value vs. Predicted Sale of properties in 2024 shows Appraised Value in red and Actual Sale price in blue. The dashed line represents where the predicted price equals the actual value. Both the assessed value and actual sale price generally match our predicted price (except at high predicted sale prices), indicating that the assessor’s values are generally fair. This is further supported by our graphs of Predicted Sale Price vs Appraised Value and Appraised Value vs Actual Sale Price, which show that both the assessed values and actual sale prices are generally close to our predicted sale prices and each other.
Looking at the variables in our model, we see that Living Area, Year, and Total Rooms are all significant predictors of sale price. These variables are likely included in the assessor’s model as well. However, we are most likely missing some important variables that the assessor uses, such as neighborhood effects or recent renovations, which could explain why our model does not perfectly predict sale prices. If we had more variables to include in our model (i.e. values that were present in only one of 2009 or 2024 but not both), we may have been able to build a model more close to the assessor’s and have been able to assess the fairness of the variables the assessor included.
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12793 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).
##
## Call:
## lm(formula = log(Sale) ~ Living.Area + Year + Total.Rooms + Number.of.Bedroom,
## data = train_model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83721 -0.10596 0.00841 0.10635 0.56936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.902e+01 1.923e+00 -20.295 < 2e-16 ***
## Living.Area 2.779e-04 1.569e-05 17.711 < 2e-16 ***
## Year 2.535e-02 9.538e-04 26.579 < 2e-16 ***
## Total.Rooms 2.490e-02 9.614e-03 2.590 0.00983 **
## Number.of.Bedroom -2.597e-02 1.439e-02 -1.804 0.07174 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1707 on 570 degrees of freedom
## Multiple R-squared: 0.7432, Adjusted R-squared: 0.7414
## F-statistic: 412.3 on 4 and 570 DF, p-value: < 2.2e-16
The analysis below uses the properties that did not sell in 2024, and therefore we do not have actual sale prices to compare to assessed values. However, we can still analyze the ASR distribution and residuals from our model to assess fairness of assessed values.
We see that in general ASR, using the most recent assessed value for all homes and the 2024 sales price that our model predicted, is less than .5, which is well below the standard of .7. There are several properties with ASRs above 1.0, meaning that either our sales price model is underestimating the value of some homes, or the assessed value is overestimating. Because homes of higher sales price seem to have higher residuals, we also calculated normalized residual, which is residual divided by the appraised value.
We plotted the ASR for each property in Hamden on a map, using the property’s latitude and longitude to plot. Each property is displayed as a point, with the color indicating the ASR value. Properties with high ASRs (over-assessed) are red, and those with low ASRs (under-assessed) are blue. This helps us determine whether location plays a role in fairness of value.
The Leaflet plot shows ASRs are the lowest for homes in the southern center region. Overall, ASR is well below .7 and fairly consistently blue across the city.
The second Leaflet plots normalized residuals. There is a clear region of large, positive residuals in the bottom center of Hamden. There also seems to be a cluster of negative residuals in the Northwest region, which is where more expensive homes are. This area also had higher ASRs. We think that our model may be underestimating sales price for more expensive homes, and overestimating sales price for less expensive homes.
## `geom_smooth()` using formula = 'y ~ x'
Our analysis shows that assessed value in 2024 is generally fair. Our predictive model for sale price in 2024 indicates that assessed values generally align with our predicted prices and actual sale prices, suggesting that the assessor’s values are reasonable.
Through our model, we found that certain property features, such as Living Area and Total Rooms, are strong predictors of sale price and are likely included in the assessor’s model. The residual analysis indicates that the model assumptions are reasonably satisfied, and there is no significant overfitting based on cross-validation results.
However, there are limitations to our analysis. Our model is based on the assumption that the assessor uses a similar set of variables and a linear regression approach, which may not be the case. This model does not account for all possible factors that could influence property values, such as neighborhood effects or recent renovations. For example, we see that our model may be underestimating sale prices for more expensive homes, as indicated by the positive correlation between sale price and residuals. This suggests that the assessor may be using additional variables or a different modeling approach for higher-value properties.